Fixes #23646: Fix memory leak in scan_dags_job_background by adding singleton guard by RajdeepKushwaha5 · Pull Request #27057 · open-metadata/OpenMetadata

RajdeepKushwaha5 · 2026-04-05T20:15:52Z

Describe your changes:

Each time an ingestion pipeline is deployed via the OpenMetadata UI, scan_dags_job_background() spawns a new multiprocessing.Process (ScanDagsTask). Each process:

Imports the entire Airflow scheduler stack (~120Mi of memory)
Creates a SchedulerJob with heartrate=0, which the main scheduler marks as "failed"
Is never join()ed by the parent — so it becomes a zombie process whose memory is never released

After N deploys, the webserver pod accumulates N × ~120Mi of leaked memory and N orphaned "failed" SchedulerJob entries in the Airflow database.

Fix: Add a per-worker singleton guard with a reaper thread to scan_dags_job_background():

Per-worker guard — a threading.Lock + module-level _current_scan reference prevents spawning multiple concurrent scan processes from the same Python worker
Reaper thread — each scan spawns a lightweight daemon thread (_reap_scan) that join()s the process when it finishes, releasing resources and preventing zombies
Deferred rescan — if a deploy arrives while a scan is already running, a _rescan_requested flag is set; the reaper automatically starts one follow-up scan after the current one completes, ensuring newly deployed DAGs are always discovered
Race-safe — the rescan check is inside the _current_scan is process identity guard, so a stale reaper (whose process was already replaced) cannot spawn duplicates
No daemon=True on ScanDagsTask — Airflow's scheduler internals fork child processes to parse DAGs, which Python forbids from daemon processes (AssertionError: daemonic processes are not allowed to have children). The reaper thread is daemonized instead.

Before (broken):

def scan_dags_job_background():
    process = ScanDagsTask()
    process.start()
    # process is never joined — zombie, memory leaked

After (fixed):

_scan_lock = threading.Lock()
_current_scan: Optional[ScanDagsTask] = None
_rescan_requested: bool = False

def _start_scan():
    """Start a new ScanDagsTask and spawn a reaper thread to join it."""
    global _current_scan, _rescan_requested
    _rescan_requested = False
    process = ScanDagsTask()
    process.start()
    _current_scan = process
    reaper = threading.Thread(target=_reap_scan, args=(process,), daemon=True)
    reaper.start()

def _reap_scan(process: ScanDagsTask):
    """Wait for the scan process to finish; start a follow-up if requested."""
    process.join()
    with _scan_lock:
        global _current_scan
        if _current_scan is process:
            _current_scan = None
            if _rescan_requested:
                logger.info("Running queued rescan after previous scan finished")
                _start_scan()

def scan_dags_job_background():
    with _scan_lock:
        if _current_scan is not None and _current_scan.is_alive():
            global _rescan_requested
            _rescan_requested = True
            logger.info("DAG scan already in progress, queued rescan")
            return
        _start_scan()

2 files changed: utils.py (+45, -4), test_scan_dags_singleton.py (new, 7 test cases).

Type of change:

Bug fix

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes #23646: Fix memory leak in scan_dags_job_background by adding singleton guard
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

Note on testing: The memory leak requires a running Airflow webserver + Kubernetes pod to reproduce (deploy pipeline N times, monitor memory via kubectl top pod). Unit tests in tests/unit/test_scan_dags_singleton.py verify the singleton guard logic by mocking ScanDagsTask:

First call starts a process

Concurrent call while scan is alive sets rescan flag (no duplicate spawn)

Finished scan gets replaced on next call

Reaper thread triggers follow-up scan when rescan was requested

Reaper clears state when no follow-up is needed

Process is never created with daemon=True (Airflow children would crash)

Stale reaper cannot spawn duplicates (race condition guard)

Summary by Gitar

Logic update:
- The implementation simplifies the singleton guard by removing the _rescan_requested flag and the automatic follow-up scan, opting for a skip-if-busy approach as shown in the diff.
- The reaper thread is now correctly implemented as a daemon thread in _reap_scan to join the process and clear _current_scan without re-triggering logic.

_{This will update automatically on new commits.}

…d by adding singleton guard scan_dags_job_background() spawns a new multiprocessing.Process per deploy call. Each process imports the full Airflow scheduler stack (~120Mi) and is never join()ed, so zombie processes accumulate and memory is never released. Fix: track the running process with a threading.Lock, join() the previous process before starting a new one, skip if a scan is already in progress, and set daemon=True so zombies are cleaned up on parent exit.

github-actions · 2026-04-05T20:16:19Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Pull request overview

Fixes unbounded process/memory growth in the Airflow managed APIs by preventing scan_dags_job_background() from spawning a new scheduler-scan multiprocessing.Process on every deploy call.

Changes:

Adds a module-level lock and “current scan” process reference to guard concurrent invocations.
Joins the previous scan process (when finished) before starting a new one, and skips starting a new scan if one is already running.
Runs the scan process as a daemon and updates the function docstring to reflect the approach.

Copilot · 2026-04-05T20:19:35Z

    """
-    process = ScanDagsTask()
-    process.start()
+    global _current_scan  # noqa: PLW0603


# noqa: PLW0603 won’t silence pylint (this package uses # pylint: disable=... in multiple places, e.g. api/routes/health.py:35-37). If pylint is part of CI for this module, it may still flag global _current_scan; consider using the equivalent # pylint: disable=global-statement (or project-standard suppression) instead of noqa.

Suggested change

global _current_scan # noqa: PLW0603

global _current_scan # pylint: disable=global-statement

…dd tests - Remove daemon=True: ScanDagsTask spawns child processes (Airflow scheduler internals), which is forbidden for daemon processes - Add _rescan_requested flag: ensures deploys during an active scan queue a follow-up scan instead of silently dropping - Clarify docstring: guard is per-worker, not cross-Gunicorn - Replace noqa with pylint disable to match project conventions - Add unit tests covering singleton guard behavior

github-actions · 2026-04-05T20:24:38Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

After joining a finished scan, check _rescan_requested before starting a new process. If no rescan was queued (flag is False), return early instead of unconditionally spawning a new scan. This ensures deploys that arrive during an active scan actually trigger a follow-up scan. Updated tests to cover both paths: rescan-requested starts new scan, no-rescan-requested returns without spawning.

github-actions · 2026-04-05T20:28:54Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

github-actions · 2026-04-05T20:33:36Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…test - Extract _start_scan() and _reap_scan() helpers from main function - Reaper thread join()s the scan process and automatically starts a follow-up scan if _rescan_requested was set, ensuring deploys during an active scan are never lost — even without another deploy call - Simplify scan_dags_job_background() to just guard + delegate - Strengthen test_no_daemon_flag_on_process: assert process.daemon stays False after construction (catches post-init daemon=True) - Add test_reaper_starts_follow_up_when_rescan_requested - Add test_reaper_clears_current_scan_without_follow_up

github-actions · 2026-04-05T20:48:27Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

A stale reaper thread (for process A) must not trigger a rescan if another scan (process B) has already replaced it. Move the _rescan_requested check inside the 'if _current_scan is process' block so only the reaper for the current scan can start a follow-up. Add test_stale_reaper_does_not_spawn_duplicate to cover the scenario.

github-actions · 2026-04-05T20:52:04Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

github-actions · 2026-04-05T20:52:45Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot flagged that _reap_scan() calling _start_scan() forks a new multiprocessing.Process from a non-main thread. On Linux with the default 'fork' start-method, this can deadlock because only the calling thread is replicated while locks held by other threads remain permanently locked in the child. Fix: the reaper thread now only join()s the process and clears module state. The _rescan_requested machinery is removed — newly deployed DAGs are discovered by the next deploy-triggered scan or by Airflow's periodic scheduler.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-19T18:15:05Z

+    Runs in a daemon thread.  Only joins the process and clears module
+    state — never forks a new process, because forking from a non-main
+    thread with the default ``fork`` start-method can deadlock.
+    """


The PR description states the reaper can start a follow-up scan when a deploy arrives mid-scan, but _reap_scan explicitly documents and enforces the opposite (“never forks a new process”). Please align the intended behavior across PR description, code comments, and tests (either remove the deferred-rescan claim from the description or adjust the implementation/tests accordingly).

Copilot · 2026-04-19T18:15:05Z

+def test_reaper_never_forks():
+    """Reaper thread must never start a new process (fork from non-main thread)."""
+    _reset_module_state()
+
+    finished_process = MagicMock()
+    utils_module._current_scan = finished_process
+
+    with patch.object(utils_module, "ScanDagsTask") as mock_cls:
+        utils_module._reap_scan(finished_process)
+
+    mock_cls.assert_not_called()
+
+


Tests currently assert that _reap_scan must never start a new ScanDagsTask (see docstring in this test), which conflicts with the PR description’s “deferred rescan” behavior. If the intended fix includes a queued follow-up scan, adjust the tests to cover that behavior; otherwise, update the PR description to match the implemented skip semantics.

gitar-bot · 2026-04-20T08:43:13Z

Code Review ✅ Approved 6 resolved / 6 findings

Fixes memory leak in scan_dags_job_background by adding singleton guard, addressing six concurrency and process management issues including daemon thread crashes, silent scan skips, and stale reaper threads. No remaining issues found.

✅ 6 resolved

✅ Bug: daemon=True will crash ScanDagsTask when it spawns child processes

📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:236 📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:165-176 📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:207-209
Python's multiprocessing forbids daemon processes from creating child processes — doing so raises AssertionError: daemonic processes are not allowed to have children.

ScanDagsTask.run() delegates to DagFileProcessorManager (Airflow 3.0+), SchedulerJobRunner (Airflow 2.6+), or SchedulerJob (older), all of which internally fork child processes to parse DAG files. Setting daemon=True on the ScanDagsTask process will cause it to crash immediately when it tries to spawn those children.

The singleton guard and join() logic are correct and sufficient to prevent zombies. The daemon=True flag should be removed.

✅ Edge Case: Silently skipping scan may lose deploy-triggered DAG refreshes

📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:229-232
When a scan is already in progress and a new deploy triggers scan_dags_job_background(), the call is silently skipped (line 231-232). If the running scan started before the new DAG file was written to disk, the new DAG won't be picked up until the next manual deploy or scheduled scan.

Consider queuing a single follow-up scan (e.g., a boolean _rescan_requested flag checked after join) so that at most one additional scan runs after the current one completes. This ensures newly deployed DAGs are always picked up without spawning unbounded processes.

✅ Bug: _rescan_requested flag is set but never read to trigger a rescan

📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:214 📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:229-231 📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:238 📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:244
The _rescan_requested flag is set to True when a scan is already in progress (line 238) and cleared when a new scan starts (line 244), but nothing ever reads the flag to actually trigger a follow-up scan. The docstring (lines 229-231) promises that "the request is deferred ... so newly deployed DAGs are picked up once the current scan finishes," but this never happens — the flag is write-only.

This means the original "silently skipping" problem from the previous review is not actually fixed; it's just given a flag name. A deploy that arrives while a scan is running will still be lost.

To actually honour the flag, after the current scan finishes (detected at lines 236-242), check _rescan_requested and start a new scan if it's True. For example:
with _scan_lock:
    if _current_scan is not None:
        if _current_scan.is_alive():
            _rescan_requested = True
            logger.info("DAG scan already in progress, queued rescan")
            return
        _current_scan.join(timeout=5)
        _current_scan = None
        if not _rescan_requested:
            return  # no rescan needed, original caller already ran

    _rescan_requested = False
    process = ScanDagsTask()
    process.start()
    _current_scan = process
Alternatively, the docstring should be updated to stop claiming deferred rescans occur.

✅ Bug: Deploy after finished scan silently skips DAG scanning

📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:236-244 📄 openmetadata-airflow-apis/tests/unit/test_scan_dags_singleton.py:73-87
When scan_dags_job_background() is called for a new deploy and finds a previously finished (not alive) scan process with _rescan_requested == False, it joins the old process but then returns at line 244 without starting a new scan for the current deploy.

Concrete scenario:

Deploy A → starts scan process, scan completes.

Deploy B → calls scan_dags_job_background(). _current_scan is not None (stale finished process), is_alive() returns False. Code joins it, sets _current_scan = None, checks _rescan_requested which is False → returns without scanning.

Deploy B's DAG is never picked up by the scheduler until some future deploy happens to trigger a scan.

Since scan_dags_job_background is only called on deploy (no periodic trigger), this means any deploy that arrives after the previous scan has finished will be silently dropped. This is the common case — most deploys don't overlap with a running scan.

The if not _rescan_requested: return guard at lines 243-244 conflates two situations: (a) a stale process being cleaned up with no new work needed (not a real scenario — the function is only called when there IS work to do), and (b) a new deploy request that should always trigger a scan.

The test test_no_new_scan_when_finished_without_rescan_flag asserts this broken behavior as correct.

✅ Edge Case: Deferred rescan has no automatic trigger mechanism

📄 openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py:237-240
When a deploy arrives while a scan is running, _rescan_requested is set to True and the function returns. However, there is no callback, timer, or polling mechanism to trigger the deferred rescan once the current scan finishes. The rescan only happens if another deploy request arrives later and calls scan_dags_job_background() again.

Scenario:

Deploy A → starts scan.

Deploy B (while A's scan is alive) → sets _rescan_requested = True, returns.

A's scan finishes. No further deploys arrive.

Deploy B's DAG is never scanned.

This is a real concern because the whole point of the _rescan_requested flag is to handle rapid successive deploys, which by definition may not have a third deploy to trigger the queued rescan.

Consider adding a lightweight polling thread or a completion callback that checks _rescan_requested after the process finishes and re-invokes the scan if needed.

...and 1 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot AI review requested due to automatic review settings April 5, 2026 20:15

Copilot started reviewing on behalf of RajdeepKushwaha5 April 5, 2026 20:16 View session

gitar-bot Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

gitar-bot Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

Copilot AI reviewed Apr 5, 2026

View reviewed changes

gitar-bot Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

Copilot AI review requested due to automatic review settings April 5, 2026 20:28

Copilot started reviewing on behalf of RajdeepKushwaha5 April 5, 2026 20:29 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

Comment thread openmetadata-airflow-apis/tests/unit/test_scan_dags_singleton.py

Apply black formatting to test file

9d7c859

gitar-bot Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

gitar-bot Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

Copilot AI review requested due to automatic review settings April 5, 2026 20:48

Copilot started reviewing on behalf of RajdeepKushwaha5 April 5, 2026 20:48 View session

gitar-bot Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

Copilot AI reviewed Apr 5, 2026

View reviewed changes

Comment thread openmetadata-airflow-apis/tests/unit/test_scan_dags_singleton.py Outdated

Comment thread openmetadata-airflow-apis/openmetadata_managed_apis/api/utils.py Outdated

Remove unused 'call' import from test file

26e972f

Copilot AI review requested due to automatic review settings April 5, 2026 20:52

RajdeepKushwaha5 had a problem deploying to test April 12, 2026 04:47 — with GitHub Actions Error

RajdeepKushwaha5 temporarily deployed to test April 12, 2026 05:17 — with GitHub Actions Inactive

Merge branch 'main' into fix/scan-dags-memory-leak-singleton

df061c6

Copilot AI review requested due to automatic review settings April 19, 2026 18:11

Copilot started reviewing on behalf of RajdeepKushwaha5 April 19, 2026 18:12 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

RajdeepKushwaha5 temporarily deployed to test April 19, 2026 18:21 — with GitHub Actions Inactive

RajdeepKushwaha5 had a problem deploying to test April 19, 2026 18:21 — with GitHub Actions Failure

RajdeepKushwaha5 temporarily deployed to test April 19, 2026 18:21 — with GitHub Actions Inactive

Merge branch 'main' into fix/scan-dags-memory-leak-singleton

bcba276

RajdeepKushwaha5 temporarily deployed to test April 20, 2026 08:52 — with GitHub Actions Inactive

RajdeepKushwaha5 had a problem deploying to test April 20, 2026 08:52 — with GitHub Actions Failure

RajdeepKushwaha5 temporarily deployed to test April 20, 2026 08:52 — with GitHub Actions Inactive

	global _current_scan # noqa: PLW0603
	global _current_scan # pylint: disable=global-statement

Conversation

RajdeepKushwaha5 commented Apr 5, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

github-actions Bot commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RajdeepKushwaha5 commented Apr 5, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented Apr 20, 2026 •

edited

Loading